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This work addresses the problem of recovering lost or damaged satellite 
image pixels (gaps) caused by sensor processing errors or by natural 
phenomena like cloud presence. Such errors decrease our ability to monitor 
regions of interest and significantly increase the average revisit time for all 
satellites. This paper presents a novel neural system based on conditional 


deep generative adversarial networks (cCGAN) optimized to fill satellite 


imagery gaps using surrounding pixel values and static high-resolution visual 
priors. Experimental results show that the proposed system outperforms 
traditional and neural network baselines. It achieves a normalized least 
absolute deviations error of L; = 0.33(21% and 60% decrease in error 
compared with the two baselines) and a mean squared error loss of 
Lz = 0.15 (29% and 73% decrease in error) over the test set. The model can 
be deployed within a remote sensing data pipeline to reconstruct missing 
pixel measurements for near-real-time monitoring and inference purposes, 
thus empowering policymakers and users to make environmentally informed 
decisions. 
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1. INTRODUCTION 

Climate change poses serious challenges that threaten humanity's long-term safety [1]. Addressing 
these challenges depends on breakthroughs in environmental policy and climate science [2]. Climate research 
is key to understanding the long-term effects of global warming on agriculture [3-4], food security [5], air 
quality [6], and weather conditions [7]. On the other hand, one of the primary data sources that empower 
environmental research is satellite imagery [8]. Satellite programs such as landsat [9], sentinel, aqua/terra, 
among others, provide a wealth of freely available data sets for the masses. This tremendous progress has 
unlocked many innovations that extract valuable insights [10-11] from satellite imagery using big data 
pipelines and advanced machine learning systems [12]. 

Satellite sensors are limited by their Spatio-temporal resolution. A satellite's temporal resolution 
represents the duration of getting information about the same point on earth. On the other hand, spatial 
resolution specifies the surface size of | pixel of information (ex. Sentinel-2 has an RGB spatial resolution of 
10 x 10 m per pixel). Due to various reasons, most satellite imagery contains "holes" or "gaps" of missing 
pixel values. Clouds are the primary contributor to such noise. Satellite noise is challenging because it 
worsens the satellite's temporal resolution and introduces uncertainty into atmospheric monitoring pipelines. 
Many have resorted to using IoT sensors [13] that provide a higher-quality ground-level stream of 
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measurements [14]. However, ground sensors can only give information about a specific location and, as a 
result, do not have the geographic coverage that satellites have (most polar satellites cover the whole earth). 

Remote sensors are the primary data source for large-scale atmospheric monitoring and, more 
specifically, air quality monitoring. Enhancing the Spatio-temporal resolution of satellite sensors is of critical 
importance since it enables greater visibility over the state of planet earth. For this reason, this study focuses 
on inpainting satellite NO, images. Each NO, pixel measures the atmospheric NO, vertical density (in 
Dobson units) over the pixel. NO, is a trace gas that negatively affects air quality and the climate. It is linked 
to road traffic and industrial activities such as fossil fuel combustion [15]. A high NO, concentration can 
cause numerous respiratory diseases [16]. 

This paper proposes a generative adversarial system used to fill the missing gaps in images based on 
the image's content (available pixel values) and high-resolution visual priors. The system's novelty lies in its 
use of a different data modality (higher-resolution static RGB images) encoded by a conditional layer to 
provide auxiliary features to the completor network. As a result, the neural system inpaints all future images 
and pushes the sensor's temporal resolution to its theoretical limit (nullifying the effects of clouds or sensory 
errors). The paper's main contributions are outlined as: 

— AcGAN-based neural system for inpainting multi-spectral satellite imagery. 

— A full description of the pre-inference data preprocessing pipeline. 

— Case study: the method is evaluated on NO, pollution images for near-real-time air quality monitoring, 
showcasing the potential of fusing multi-modal satellite data using neural approximators. 

The rest of the paper is structured as; "Related works" describes the most notable research efforts 
that tackle image inpainting. "Research method" introduces the neural system architecture, the training 
algorithm, and the data set. "Results and discussion" describes synthetic noise generation, introduces the 
performance metrics, and presents the final results. Finally, it provides an intuitive understanding of the 
effects of priors and their limitations. "Conclusion" summarizes the paper and describes future work. 


2. RELATED WORKS 

The existing research literature on image inpainting can be grouped into two main parts. Non- 
learning methods such as diffusion/patch-based algorithms, and the relatively recent work that attempts to 
learn inpainting by training convolutional neural network-based architectures (CNNs). This section outlines 
the most notable efforts from both sides. 


2.1. Diffusion or patch-based methods 

The early success in image inpainting is attributed to information propagation techniques through 
patch similarity or variational methods. Efros and Leung [17] proposed to model image textures as Markov 
random fields then use similarity search to fill the missing pixels. Other efforts [18-19] were directed toward 
inpainting images through their texture and structure using search-based guided propagation that synthesizes 
patterns resembling other image regions or other images within a searchable database. 

Variational methods are also present in [20] that use feature extractors such as patch statistics, 
colors, and gradients to synthesize the missing image gaps. Lastly, out-of-sample inpainting was achieved by 
[21] using an extensive database of images. Its algorithm inpaints missing regions of an image by finding 
similar images then diffusing extracted low-level features. Unlike others, this technique can suggest multiple 
completions based on the chosen database item. 

This class of methods works well on images that contain repeated or static patterns (examples: sand, 
grid, paper) but fails on images with rich semantic content. Furthermore, automatic non-learning algorithms 
cannot inpaint abstractions that make complex images cohesive in their content, and their use of out-of- 
sample information is limited due to their local dependencies. 


2.2. Learning-based approaches 

One of the earliest efforts to use representation learning for image inpainting proposed a multi-layer 
perceptron (MLP) architecture to fill missing pixels in gray-scale images by minimizing the reconstruction 
loss [22]. The paper established the importance of masking missing pixels and the potential of neural 
networks (NNs) in image completion. Furthermore, Xu ef al. [23] used a CNN architecture to propose a 
general method for solving three tasks: image inpainting, denoising, and image degradation recovery. 

Recently, neural networks trained using pixel-wise reconstruction error and adversarial loss reported 
promising results. The work of [24] introduced context encoders to fill large holes in image centers. 
Yang ef al. [25] enabled high-resolution image inpainting by proposing joint content and texture losses. 
Xu et al. [26] combined local and global discriminators into one network and used convolutions and dilated 
convolutions to inpaint images. 
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Yu et al. [27] improved the previous architecture by dividing the generation process into two stages. 
The first outputs a blurry image optimized with spatial discounted £, reconstruction loss, and the second 
refines and outputs the final image. The authors used the network's output as input to the global and local 
discriminators and chose wasserstein GANs (WGAN) to train the neural system (WGAN stabilizes the 
overall optimization process). Finally, Xu et al. [28] improved the previous architecture by incorporating 
contextual attention and dilated gated convolutions into both the coarse and refinement networks. 

Although the mentioned neural systems provide impressive inpaintings and predict high-quality 
visual semantics, none have experimented with priors or extended the generator/discriminator with a 
conditional layer. Furthermore, all of the mentioned methods assume one source data distribution to be 
modeled, as shown in Table 1. This study establishes the importance of using different data modalities and 
fusing them through a conditional layer to solve image inpainting in general, and air quality estimation 
specifically. 


Table 1. A Comparison of different algorithms that aim to solve image inpainting 
Features\Methods | Non-Parameteric Sampler [17] _ PatchMatch [29] | Mask-FC Inpainter [22] _ LocalGlobal [26] | Ours 


Free-Form JV JS Sf of if 
Out-of-Sample of if of ey 
Semantics PY J es 
Multi-Modal V 


3. RESEARCH METHOD 

Two CNN-based network architectures were trained within a conditional adversarial framework as 
shown in Figure 1. The generator network, responsible for filling the missing gaps using contextual 
information and static priors, and an auxiliary discriminator network trained to distinguish between real and 
completed pollution patches. Both networks are conditioned over true-color imagery that corresponds to the 
region covering the input patch. The prior is encoded by a conditional layer (reducer). The input to the 
generator consists of a damaged image (x) and its high-resolution prior (p). The reducer network compresses 
p to the same size of x then stacks both for inpainting. The discriminator network takes either a healthy or a 
completed image with its encoded prior. The discriminator judges if an image is real or completed. 
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Figure 1. System overview 


3.1. Convolutional neural networks 

The reducer, completor, and discriminator networks are based on convolutional neural networks 
(CNN). CNNs are a special type of neural network that uses weight sharing to extract hierarchical visual 
features with minimal free parameters and maximal local connections. Kernel weights are optimized to 
produce activations that help in the final prediction task. CNNs are capable of progressively extracting 
higher-order abstractions that serve to minimize a pre-defined objective function. A specific activation is 
calculated using (1). 


Yn = o(b + v6 yen Kj; Aasinng) (1) 


With X representing the input, K the kernel (matrices of learnable weights), s is the kernel size, and 
(m,n) are the indices of the target value Y,,, in the activation layer. Dilated convolutional layers are also 
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used to simulate a larger receptive field without adding more parameters. To calculate dilated activations, one 
parameter is added to the previous definition: n, which is the dilation factor as (2). 


Yn = 0(b + Dito 90 Kj Xmininens) (2) 


3.2. Conditional generative adversarial networks 

Generative adversarial networks (GAN) are a class of neural networks trained in an adversarial 
manner. A GAN consists of two networks: a generative network G(.) that learns the true data distribution 
(the process that generated the training data) and a discriminative network D(.) that estimates if a sample 
came from the true data distribution or G(. ). Ideally, both G and D are trained simultaneously, G's parameters 


are adjusted to minimize Log a - D(G(2))) (i.e., to fool the discriminator), and D's parameters are tuned to 


maximize Log(D (x)) (i.e., to detect fake generated inputs). D and G play the following two-player minimax 
game with value function V(G, D) as (3). 


min maxV (G,D) = Ey~pq([log(D(@))] + Ex~p, [log (1 - D(G@))| (3) 


In the context of this study, visual priors are of higher resolution than pollution patches. 
Downsampling visual imagery to fit pollution patches will result in losing much of its encoded information. 
Additionally, from the perspective of a vanilla GAN (i.e., an unconditioned GAN), there is no control over 
data modes during the inpainting process. Inputting pollution images without pixel-level meta-data will result 
in a model that mimics a general-purpose interpolator. However, by conditioning the output over its region's 
visual imagery, the model can produce accurate inpaintings by finding correlations between priors and 
pollution patches. 

As a result, the generative adversarial network is extended with a conditional layer to encode the 
static priors. The generator and discriminator are provided with high-resolution encoded imagery (priors: p). 
The objective function of the two-player minimax game is updated as (4). 


min max V (G,D) = Ex~pacs [log(D(@Ip))] + Ez~pa[log(1 — D(Gzlp)IP))| (4) 


In a conditional generative adversarial setup, the same condition is provided to both the generator 
and discriminator networks. Priors are purposely used as conditions to help the generator enhance its 
completions. For example, one can imagine how useful vehicle traffic density images would be for a model 
that predicts near-real-time NO, concentrations. In this case, RGB images provide low-level information 
about urban and greenness densities. This study argues that a visual prior could be useful to the task of 
predicting NO, densities over large regions of interest. 


3.3. Completion network 

The completion network takes low-resolution NO, images that contain the gaps to be sfilled, and a 
mask channel that indicates which pixels are missing. Each damaged patch has its corresponding high- 
resolution RGB image that covers the same region and provides gap-free visual information. Two networks 
were trained. The reducer acts as a down-sampler that intelligently resizes the high-resolution RGB image 
(the prior) to the same size as the damaged image. Table 2 presents its layers in successive order. The 
completor network is a fully convolutional network (FCN) that acts as the main inpainter. It is optimized to 
fill the missing gaps in the input image. Table 3 specifies its layers. The activations for both networks were 
passed through batch normalization and ReLU after each layer. 


Table 2. Reducer network layers 


N° Type Kernel Stride Output 
1 Conv. 5 1 16 
2 Conv. 3 2 32 
3 Conv. 3 1 64 
4 Conv. 3 2 32 
5 Conv. 3 1 1 
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Table 3. Completor network layers 


N° Type Kernel Stride _ Dilation —_ Output 
1 Conv. 5 1 1 16 
2 Conv. 3 2 1 32 
3 Conv. 3 1 1 64 
4 Conv. 3 2 1 64 
5 Diconv. 3 1 2 64 
6 Diconv. 3 1 3 64 
7 Conv. 3 1 1 64 
8 conv’ 4 2 1 32 
9 Conv. 3 1 1 32 

10. conv™ 4 2 1 16 

11 Conv. 3 1 1 8 

12 Conv. 3 1 1 1 


The completor network is used to inpaint the missing regions in the input. On the other hand, the 
reducer network resizes the prior to the same size as the input. Without it, the model would learn the 
unconditioned pollution image distribution, which is not optimal for location-variant patterns. Urban, land, 
and other visual features serve as strong priors for the completor to generate accurate patches. The completor 
network was trained using mean squared error loss (MSE) averaged over the masked (gap) pixels. 


3.4. Discriminator network 

The discriminator network is trained to detect completed NO, images. A ResNet-18 [30] 
architecture shown in Figure 1 is used to extract the feature vector which is mapped to the probability of the 
input being completed (fake) or real. Reducer network weights are frozen while optimizing the discriminator 
since the reducer is optimized for efficient inpainting, not discrimination. The primary role of RGB images in 
the context of the discriminator is to provide a useful prior that is independent of whether the input image is 
real or not. Hence, the discriminator is optimized to estimate P(input = completed|p). 


3.5. Training 

The completor network is denoted: C(x,p). x represents a batch of NO, images with masks M 
comprised of Os and 1s, with 1s representing the pixels that are missing in x. p are the priors for each image 
in x. They consist of many RGB images for the corresponding ROIs. Similarly, D(X, p) designates the 
discriminator network, X represents the pollution images (real or completed), and p the encoded priors over x 
's regions of interests (ROIs). 

Mean squared error loss (£2) is an inpainting loss choice that results in blurred estimations over the 
gaps. It averages the squared differences between gap pixel predictions and targets as (5). 


£(x,2) = ||M. © (C&P) -¥)] (5) 
On the other hand, adversarial loss can be formulated as (6). 
min max E [log(D(xlp)) + log(1 — D(C(zIp)Ip))| (6) 


C is the completor network, D is the discriminator, x is the input, X is the damaged/healthy input, and p the 
prior. £2 and adversarial losses are combined to formalize the general optimization problem as (7). 


min max E [L2(x,p,C) + log(D(x|p)) + log(1 — D(C(x|p)|p))] (7) 


GANs are challenging to train due to the instability between the generator and discriminator 
networks in the early training phase. For this reason, the training loop is balanced as described in 
Algorithm 1. 

The method proposed in [26] is chosen as the neural baseline. Its model was trained to produce 
visually appealing completions for a variety of natural scene images. It serves as a good benchmark because 
the proposed architecture is an extension of the baseline's modular design. 
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Algorithm 1: T, & Tp serve to pre-train each network separately before conducting adversarial training 


Data: X, M., Ma, p 
Hyperparameters: T,,,, = 10,000, T.=100, T, =100, n=0.001, n’ = 0.01 
Begin 
For t; € [0,Timax] do 
Sample minibatch x CX 
Generate completor masks M, for Vx,Ex 
Generate discriminator masks Mg for Vx,;Ex 
If t;<Tmax—Tc then 
Wo — We —Wwy_lo, p,C) 
Else if t;<Tna,—Tp then 
Wy — Wy - Vw, BCE (x, p, D) 
Else 
Update We. with joint loss 
Update Wp, with binary cross entropy loss 


3.6. Data 

The european organization of the exploitation of meteorological satellites (EUMETSAT) is an 
international satellite agency responsible for acquiring, preprocessing, and distributing reliable weather, 
climate, and environmental data. Its low-orbiting satellite, MetOp, continuously delivers critical climate data. 
EUMETSAT also distributes data from other partners such as the national oceanographic and atmospheric 
administration (NOAA). Offline EUMETSAT's data products are free and available for research purposes. 

MetOp is a series of 3 polar-orbiting meteorological satellites developed by the european space 
agency (ESA) and operated by EUMETSAT. MetOp takes 90 minutes to orbit the earth, totaling 14 times a 
day. Having three satellites enhances the temporal resolution of MetOp as a data provider. MetOp carries a 
payload of 11 scientific instruments. After transferring the data, it gets preprocessed into multiple levels and 
fed into numerical simulators for weather forecasting and environmental monitoring. Many of its available 
data products provide vertical density measurements. 

Gas traces were acquired from the global ozone monitoring experiment-2 (GOME-2) instrument, a 
scanning spectrometer that provides global monitoring coverage. The near real-time total column (NTO) 
product provides concentration measurements for four types of atmospheric trace gases: 03, NO2, SO2, and 
HCHO. The product is operational since 01/12/2007, has a spectral resolution of 0.26 —0.51nm, and 
provides global geographic coverage with a spatial resolution of 40 x 40km (MetOp-A). 

On the other hand, Meteosat is a series of geostationary meteorological satellites operated by 
EUMETSAT. Meteosat second generation (MSG) provides images of the full earth disc and data for weather 
forecasts. It has a temporal resolution of 15 minutes. The spinning enhanced visible and infrared imager 
(SEVIRI) instrument captures the true-color images in a spatial resolution of 1 xX 1km. The prior's (p) 
imagery is collected and preprocessed from MSG's SEVIRI instrument. 

MSG covers the ROI of Morocco. The tiles were clipped using the region of interest and merged 
through pixel-averaging to store a single (mosaic) high-quality image over the ROI. 

In this study, Morocco was chosen as a region of interest. All images were filtered to be in the 
bounding box [(—5.39,35.54), (—5.34,35.54), (—5.34,35.59), (—5.39,35.59), (—5.39,35.54)] in (latitude, 
longitude) coordinates. All SEVIRI tiles were taken from 06/2017 to 01/2018. For pollution images, Tiles 
were acquired and filtered for the same ROI that range from 03/2018 to 09/2018. The priors were sampled 
from a previous timeframe because they will be used to inpaint future patches. Synthetic mask generation is 
explained in the "RESULTS AND DISCUSSION" section. 

T is denoted as the set of acquired tiles for the region of interest. for each t; € T, it is processed is 
by as: 
—  t, is projected into a pre-defined static spatial grid to normalize pixel positions. 

—  t, is splitinto N 64 x 64 patches, denoted x;. 

—  Foreach image xj € Xj, its 256 x 256 RGB prior (p;;) is collected. It covers the same region as X;;. 

—  {xij,Pij;} are split into two sets. The first contains healthy pollution patches (all pixel values are 
available). The second set holds the damaged patches. 

— Pixel values were normalized and standardized between [0,1] for both input patches x; and priors p;. 

The final data set was clipped at 1M NO, patches. 90% of the data set was used to train the model and the 

remaining 100,000 images for testing. A time-series train/test split was conducted where all of the 10% test 

samples came from a future time-interval. 


Int J Artif Intell, Vol. 10, No. 1, March 2021: 121 — 130 


Int J Artif Intell ISSN: 2252-8938 i) 127 


4. RESULTS AND DISCUSSION 
4.1. Noise generation 

As opposed to the task of natural image inpainting, where not much consideration is given to the 
shape of the gap mask, the geometry of missing regions in satellite imagery follows strong patterns (clouds, 
holes, and lines). The mask generator should reproduce these patterns in training time. As a result, a separate 
GAN G, was trained using the isolated noised patches to learn the natural noise distribution. G, serves as a 
noise mask generator that is used to sample artificial gaps during training. For every healthy patch x; € X, a 
64 xX 64 gap mask is sampled from G,, a copy of x; is purposely damaged using the mask. The mask is 
stacked on top of the damaged image; the final 2-channel image represents the input. The original image x; 
becomes the target used in loss calculation. 


4.2. Results 

The proposed model was benchmarked against two algorithms, PatchMatch [29], which represents 
the classical state-of-the-art method, and the LocalGlobal [26] neural inpainter. For training, input images of 
size 64 X 64 pixels and prior images of size 256 X 256 pixels were used as input to the model. The training 
and testing datasets were extracted using a time-series split. 

As opposed to natural scene completion, where MAE and MSE are not considered good metrics 
because many completions can be conceptually possible, in the case of pollution images, sensor 
measurements are unique targets. Hence, the model was evaluated using two regression metrics; least 
absolute deviations loss as (8). 


Ly = =YRalM. © (C(% Pi) — xi) (8) 


£, averages the absolute differences between the target and predicted gap pixels. Mean squared 
error (£2) is also reported. Both metrics measure error as the distance between the model's predictions and 
the ground-truth normalized targets. 

The model achieves a least absolute deviations error of L; = 0.33 (21% and 60% decrease in error 
with respect to the two baselines) and an MSE of £L, = 0.15 (29% and 73% decrease in error, respectively) 
over the test set. These results represent a significant performance increase over the PatchMatch and 
LocalGlobal inpainters. When converted to Dobson Units (DU), we get £; = 0.56 DU and Lz = 0.41 DU. 
1 dobson unit corresponds to a column density of 2.8 x 101°cm~?, the raw measurements range between 0 
and 2.5 DU. 


4.3. Discussion 

The main contributor to the increase in performance is the conditional layer. An ablation study was 
conducted by removing the conditional layer and training the GAN without visual priors. The resulting £, 
and £2 scores were slightly worse than LocalGlobal [26], last column of Table 4. 


Table 4. Benchmarks of mean £, and mean squared loss on the test set. 


Metrics\Methods _ PatchMatch [29] LocalGlobal [26] Ours Ours (no priors) 
Ly 0.83 0.42 0.33 0.44 
L, 0.55 0.27 0.15 0.28 


Figure 2 showcases the effects that visual priors can have on pollution predictions. The model 
predicts higher NO, concentrations in the city of Casablanca without processing high-density neighboring 
pixels, top row in Figure 2. It predicted high NO, concentrations by relying solely on the RGB prior. Such 
patterns are noticeable in other urban and industrial cities in Morocco. However, at the bottom row, the 
model failed to predict a high pollution concentration over the region of Taourirt. That could have been the 
result of the temporal nature of industrial activities and the noise that is inherent to sensory measurements. 
The average increase in performance indicates that the model learned to associate certain visual features to 
high/low pollution densities. This study showcases the benefits of multi-modal learning in computer vision. 
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0.6 


Figure 2. Sample: normalized predictions over northern Morocco. From left to right: prior, input image, 
generated image, and ground-truth image (higher values represent greater concentrations) 


Despite the prior not providing population, time-dependent industry activity estimates, or traffic 
information, it increased the overall performance by simply providing high-quality visual information. The 
reducer also played a critical role in transforming the priors in a way that was useful to inpaint the missing 
regions. The proposed model can also be used as a data enahncer for downstream tasks by remving cloud 
effects. Downstream applications insclude object detection [31], landcover generation [32], landuse 
classification [33], change detection, crop monitoring, field and urban mapping among others. 


5. CONCLUSION 

This paper proposed a neural image inpainter capable of filling measurement gaps using 
neighboring pixel values and static priors. It described the preprocessing data pipeline, the CNN-based sub- 
networks, and the training process. The neural system was successfully trained to fill pollution gaps in the 
region of Morocco. It outperformed two prior-less baselines and showed the potential of data fusion for 
satellite image inpainting. The described neural system can be deployed within a remote sensing pipeline to 
fill incoming satellite patches, resulting in greater near-real-time visibility over weather, atmospheric, and 
climate conditions. The system intelligently nullifies the effects of sensor perturbations and cloud effects. 
However, one limitation of the system is that temporal resolution enhancement is not enough to offer a 
competitive alternative to IoT-based devices for small-scale monitoring, the limited spatial resolution of 
satellite sensors remains the most important open challenge in remote sensing. However, the model can be 
modified to tackle super-resolution through prior learning. Such system can enhance the satellite's spatial 
resolution by using low-resolution source imagery and high-resolution static priors. 
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